Testing Closeness With Unequal Sized Samples
نویسندگان
چکیده
We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter ε > 0, m1 independent draws from an unknown distribution p with discrete support, and m2 draws from an unknown distribution q of discrete support, we describe a test for distinguishing the case that p = q from the case that ||p− q||1 ≥ ε. If p and q are supported on at most n elements, then our test is successful with high probability provided m1 ≥ n/ε and m2 = Ω ( max{ n √ m1ε 2 , √ n ε2 } ) . We show that this tradeoff is information theoretically optimal throughout this range in the dependencies on all parameters, n,m1, and ε, to constant factors for worst-case distributions. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on n states up to a log n factor that uses Õ(nτmix) queries to a “next node” oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.
منابع مشابه
Mathematical Programming Models for Solving Unequal-Sized Facilities Layout Problems - a Generic Search Method
This paper present unequal-sized facilities layout solutions generated by a genetic search program named LADEGA (Layout Design using a Genetic Algorithm). The generalized quadratic assignment problem requiring pre-determined distance and material flow matrices as the input data and the continuous plane model employing a dynamic distance measure and a material flow matrix are discussed. Computa...
متن کاملCompetitive Classification and Closeness Testing
We study the problems of classification and closeness testing. A classifier associates a test sequence with the one of two training sequences that was generated by the same distribution. A closeness test determines whether two sequences were generated by the same or by different distributions. For both problems all natural algorithms are symmetric—they make the same decision under all symbol re...
متن کاملDifferentially Private Testing of Identity and Closeness of Discrete Distributions
We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over k elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problem...
متن کاملDifferentially Private Identity and Closeness Testing of Discrete Distributions
We investigate the problems of identity and closeness testing over a discrete population from random samples. Our goal is to develop efficient testers while guaranteeing Differential Privacy to the individuals of the population. We describe an approach that yields sample-efficient differentially private testers for these problems. Our theoretical results show that there exist private identity a...
متن کاملPitman-Closeness of Preliminary Test and Some Classical Estimators Based on Records from Two-Parameter Exponential Distribution
In this paper, we study the performance of estimators of parametersof two-parameter exponential distribution based on upper records. The generalized likelihood ratio (GLR) test was used to generate preliminary test estimator (PTE) for both parameters. We have compared the proposed estimator with maximum likelihood (ML) and unbiased estimators (UE) under mean-squared error (MSE) and Pitman me...
متن کامل